April 17, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)
Success:
Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!
Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!
Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?
So far, we’ve talked about three types of generative models.
Autoregressive Models
\[ P(\mathbf x) = \prod \limits_{t = 1}^T P(x_t | x_1,x_2,...,x_{t-1}) \]
Advantages:
Directly compute and maximize \(P(\mathbf x)\)
Generates high quality images due to pixel by pixel generation strategy
Disadvantages:
Very slow to train
Very slow to generate high res images
No explicit latent code
Variational Autoencoders
\[ P(\mathbf x) \ge E_{Q}[\log P(\mathbf x | \mathbf z)] - D_{KL}(Q(\mathbf z | \mathbf x) || P(\mathbf z)) \]
\[ P(\mathbf x) = \int P(\mathbf x | \mathbf z) P(\mathbf z) dz \]
Advantages:
Fast image generation
Very rich latent codes
Disadvantages:
Maximizing on a lower bound, not necessarily close to the truth
Generated images often blurry due to averaging behavior
GANs
\[ \underset{\theta}{\text{min }} \underset{\phi}{\text{max }} \frac{1}{2} E_{P(\mathbf x)}[\log D_{\phi} (\mathbf x)] + E_{Q(\mathbf z)}[\log(1 - D_{\phi}(g_{\theta}(\mathbf z))] \]
\(g_{\theta}(\mathbf z)\) is a generator network that creates fake images
Train a discriminator to separate real and fake images
Maximize the power of the discriminator while also minimizing the JS divergence between the real and fake images
GANs
Advantages:
Fast image generation
Performs arbitrarily well due to lack of distributional assumptions
Good performance in image editing
Disadvantages:
Minimal control over generation
Training is a nightmare
Today, we’re going to quickly cover the new hotness - diffusion models
Diffusion models are like a cross between VAEs and autoregressive models
Conceptually, diffusion models are quite simple.
Start with an initial image, \(\mathbf x\)
Sequentially add normal random noise to the image pixels
Repeat this process until the resulting image is a collection of random pixels
Learn a reverse mapping that decodes the random image back to its initial state
Seems simple…
Why this would work with a collection of images is a bit of a thinker.
Think about the decoder model:
Learn the best mapping from “random” back to the original image
Over a collection of images, think about the first move - try to recover a low level commonality among all images
Conditional on the first move, learn another low level commonality
Eventually, recover the original image up to arbitrary precision with enough decoder conditioning!
Two parts:
A prespecified (but stochastic) encoder that maps images to random space
A learnable decoder that inverts the decoder
The decoder is the hard part!
Let’s attack these in order.
Diffusion Encoder
For notational simplicity, let \(\mathbf x = \mathbf z_0\)
Define a sequence of latent variables over \(T\) periods
\[ \{\mathbf z_0, \mathbf z_1, ..., \mathbf z_T\} \]
The forward process:
\[ \mathbf z_{t} = \sqrt{1 - \beta_t} \mathbf z_{t-1} + \sqrt{\beta_t} \epsilon_t \]
\(\beta_t\) is a mixing value between 0 and 1
\(\epsilon_t\) is a noise draw from a standard normal distribution (most frequently all pixels are assumed independent)
The first term attenuates the input (e.g. original signal) and the second term blends in noise!
More useful specification:
\[ Q(\mathbf z_t | \mathbf z_{t-1}) = \mathcal N(\mathbf z_t | \sqrt{1 - \beta_t} \mathbf z_{t-1} , \beta_t \mathcal I) \]
A sequence of conditional distributions relating each latent variable with the last
The variance of each RV is a function of a fixed set of mixing values
Thus, we can define a joint distribution over all latent variables given an input \(\mathbf z_0\) as:
\[ P(\mathbf z_1,\mathbf z_2,...,\mathbf z_T | \mathbf z_0) = \prod \limits_{t = 1}^T P(\mathbf z_t | \mathbf z_{t-1}) \]
A Markovian process
Our conditional chain reduces to one step behind
We could use this direct sequential process to encode images, but this would be slow at scale
Fortunately, we can express the expected distribution given \(\mathbf z_0\) at any time \(t\) without needing to marginalize!
\[ \mathbf z_2 = \sqrt{1 - \beta_2} \mathbf z_1 + \sqrt{\beta_2} \epsilon_2 \]
\[ \mathbf z_1 = \sqrt{1 - \beta_1} \mathbf z_0 + \sqrt{\beta_1} \epsilon_1 \]
Omitting the algebra:
\[ \mathbf z_2 = \sqrt{(1 - \beta_2)(1 - \beta_1)} \mathbf z_0 + \sqrt{1 - \beta_2 - (1 - \beta_2)(1 - \beta_1)} \epsilon_1 + \sqrt{\beta_2} \epsilon_2 \]
We can think of the last two terms as normal distributions:
\[ \epsilon_1 + \epsilon_2 \sim \mathcal N(0 , 1 - \beta_2 - (1 - \beta_2)(1 - \beta_1)) + \mathcal N(0, \beta_2) \]
resulting in a single distribution (sums of normals):
\[ \epsilon_2 \sim \sqrt{1 - (1 - \beta_2)(1 - \beta_1)} \epsilon \]
meaning that our update equation becomes:
\[ \mathbf z_2 = \sqrt{(1 - \beta_2)(1 - \beta_1)} \mathbf z_0 + \sqrt{1 - (1 - \beta_2)(1 - \beta_1)} \epsilon \]
More broadly, let
\[ \tilde{\beta}_t = \prod \limits_{s = 1}^t (1 - \beta_s) \]
Then, we can define:
\[ P(\mathbf z_t | \mathbf z_0) \sim \mathcal N \left(\mathbf z_t \mid \sqrt{\tilde{\beta}_t} \mathbf z_0 , (1 - \tilde{\beta}_t ) \mathcal I \right) \]
We know the conditional distribution of the latent variable at any \(t\) given the input!
Called the diffusion kernel
Note that the mean necessarily goes to zero and the variance approaches identity!
Key note: we’ve “noised” the image, but we can track back to the original given the structure of the noise!
This means that I could, theoretically, learn a mapping back from noise to the original image!!!
At each time point, \(\mathbf z_t\) represents a more and more noised version of the input image
All images converge to a single point
The path is what tells us how to go from stable point back to image!
Regardless of where we start, we end up at \(\mathbf z_T \sim \mathcal N(\mathbf 0 , \mathcal I)\)!
Defines a full diffusion generator:
Each image is a draw from \(\mathbf z_T\)
Map through the backwards process
Get an image in the pixel space
The encoder defines a sequence of distributions:
\[ P(\mathbf z_t | \mathbf z_0) \text{ or } P(\mathbf z_t | \mathbf z_{t-1}) \]
Thus, the decoder should reverse these conditionals:
\[ P(\mathbf z_0 | \mathbf z_t) \text{ or } P(\mathbf z_{t-1} | \mathbf z_t) \]
It may seem like this is going to be easy, but think back to QTM 110 and Bayes’ theorem:
\[ P(\mathbf z_{t-1} | \mathbf z_{t}) = \frac{P(\mathbf z_{t} | \mathbf z_{t-1})P(\mathbf z_{t-1})}{P(\mathbf z_{t})} \]
Unfortunately, we don’t know the two marginals!
\[ P(\mathbf z_t) = \int \int ... \int P(\mathbf z_t | \mathbf z_{t-1})P(\mathbf z_{t-1} | \mathbf z_{t-2})...P(\mathbf z_{1} | \mathbf z_{0}) d \mathbf z_0 \mathbf z_1 ... \mathbf z_{t-1} \]
Important point:
Even though the forward direction conditional will be normal, by assumption, the reverse probably won’t be!
Because we assume normality in one direction, we must account for the data distribution in the other!
Can simulate for 1 dimensional model.
However, there is one reverse conditional that we can know:
\[ P(\mathbf z_{t - 1} | \mathbf z_t , \mathbf z_0) \]
The reverse conditional (e.g. backwards process) conditional on the original input!
Not useful when there is no original input
But, will be useful for training
\[ P(\mathbf z_{t-1} | \mathbf z_{t} , \mathbf z_0) = \frac{P(\mathbf z_t | \mathbf z_{t-1} , \mathbf z_0)P(\mathbf z_{t-1} | \mathbf z_0)}{P(\mathbf z_t | \mathbf z_0)} \propto P(\mathbf z_t | \mathbf z_{t-1} , \mathbf z_0)P(\mathbf z_{t-1} | \mathbf z_0) \]
\[ P(\mathbf z_{t-1} | \mathbf z_{t} , \mathbf z_0) \propto P(\mathbf z_t | \mathbf z_{t-1})P(\mathbf z_{t-1} | \mathbf z_0) \]
We know both of these by definition of the forward process:
\[ \mathcal N \left( \mathbf z_t \mid \sqrt{1 - \beta_t}\mathbf z_{t-1} , \beta_t \mathcal I \right) \mathcal N\left(\mathbf z_{t-1} \mid \sqrt{\tilde{\beta}_{t-1}} \mathbf z_0 , (1 - \tilde{\beta}_{t-1}) \mathcal I \right) \]
Through some normal Bayes magic, we can show that this distribution can be cast in terms of \(\mathbf z_{t-1}\):
\[ P(\mathbf z_{t-1} | \mathbf z_t , \mathbf z_0) = \mathcal N\left(\mathbf z_{t-1} \mid \frac{1 - \tilde{\beta}_{t-1}}{1 - \tilde{\beta}_{t}} \sqrt{1 - \beta_t} \mathbf z_t + \frac{\sqrt{\tilde{\beta}_{t-1}}\beta_t}{1 - \tilde{\beta}_t} \mathbf z_0 , \frac{\beta_t (1 - \tilde{\beta}_{t-1})}{1 - \tilde{\beta}_t} \mathcal I \right) \]
In words, the reverse conditional conditioned on the original input is also normal!
Given an input value, \(\mathbf z_t\), and a mixing rate, we can assess the likelihood of the previous latent variable!
Note that all of the terms with \(\beta\) are pre-determined and set!
We can “denoise” an image!
This diffusion model defines a diffusion mapping:
\[ P(\mathbf z_T) \sim \mathcal N(\mathbf 0 , \mathcal I) \]
\[ P(z_{t-1} | \mathbf z_t, \mathbf z_0) \sim \mathcal N\left(\mathbf z_{t-1} \mid \frac{1 - \tilde{\beta}_{t-1}}{1 - \tilde{\beta}_{t}} \sqrt{1 - \beta_t} \mathbf z_t + \frac{\sqrt{\tilde{\beta}_{t-1}}\beta_t}{1 - \tilde{\beta}_t} \mathbf z_0 , \frac{\beta_t (1 - \tilde{\beta}_{t-1})}{1 - \tilde{\beta}_t} \mathcal I \right) \]
Big problem!
Instead:
\[ Q(\mathbf z_{t-1} | \mathbf z_t, \boldsymbol \theta_t) \sim \mathcal N \left(\mathbf z_{t-1} \mid g(\mathbf z_{t}, \boldsymbol \theta_t) , \sigma^2_t \mathcal I \right) \]
Approximate the true reverse mapping without relying on \(\mathbf z_0\)
Come up with an approximate distribution that is as close as possible to \(P(z_{t-1} | \mathbf z_t, \mathbf z_0)\) without explicitly leveraging info about \(\mathbf z_0\) until the end of the diffusion process
Each time step is parameterized by its own set of value that map \(\mathbf z_t\) to \(\mathbf z_{t-1}\)
What we would like to optimize:
\[ \hat{\boldsymbol \theta}_{1,2,...,T} = \underset{\boldsymbol \theta}{\text{argmax }} \left[\sum \limits_{i = 1}^N \log P(\mathbf x_i | \boldsymbol \theta_{1,2,...,T})\right] \]
where:
\[ P(\mathbf x | \boldsymbol \theta_{1,2,...,T}) = \int P(\mathbf x , \mathbf z_1, ... , \mathbf z_T | \boldsymbol \theta_{1,2,...,T}) d \mathbf z_1 d \mathbf z_2... d \mathbf z_T \]
This is intractable.
Any ideas?
As with VAEs:
Let \(\mathbf Q(\mathbf z_1, \mathbf z_2, ... , \mathbf z_T | \mathbf x)\) be an approximation to the intractable reverse conditional, \(P(\mathbf z_1,\mathbf z_2, ..., \mathbf z_T | \mathbf x)\).
Maximize the evidence lower bound:
\[ \int Q\left (\mathbf z_{1...T} | \mathbf x \right) \log \left[\frac{P(\mathbf x , \mathbf z_{1...T}) | \boldsymbol \theta_{1...T}}{Q(\mathbf z_{1...T} | \mathbf x)}\right] \]
This ELBO is a little different than the VAE one since we need to deal with a sequence of latent variables instead of a single vector!
Witholding the math (mostly relying on the fact that the latent variables represent a Markovian process), we can show that the ELBO has the form:
\[ E_Q \left[\log P(\mathbf x | \mathbf z_1 , \boldsymbol \theta_1\right] - \sum \limits_{t = 2}^T E_Q \left[D_{KL} \left[P(\mathbf z_{t-1} | \mathbf z_t , \mathbf x) || Q(\mathbf z_{t-1} | \mathbf z_t , \boldsymbol \theta_t) \right] \right] \]
The first term is the is the likelihood of the ground truth images at the end of the latent variable chain - normal, by assumption
The second term is the KL divergence between two normals:
\[ P(\mathbf z_{t-1} | \mathbf z_t , \mathbf z_0) = \mathcal N\left(\frac{1 - \tilde{\beta}_{t-1}}{1 - \tilde{\beta}_{t}} \sqrt{1 - \beta_t} \mathbf z_t + \frac{\sqrt{\tilde{\beta}_{t-1}}\beta_t}{1 - \tilde{\beta}_t} \mathbf z_0 , \frac{\beta_t (1 - \tilde{\beta}_{t-1})}{1 - \tilde{\beta}_t} \mathcal I \right) \]
\[ Q(\mathbf z_{t-1} | \mathbf z_t, \boldsymbol \theta_t) = \mathcal N \left(\mathbf z_{t-1} \mid g(\mathbf z_{t}, \boldsymbol \theta_t) , \sigma^2_t \mathcal I \right) \]
Since the KL divergence is between two normals, we can define an analytical loss function for training a diffusion model:
\[ \begin{align} & \sum \limits_{i = 1}^N - \log \mathcal N(\mathbf x_i | g(\mathbf z_{i1} , \boldsymbol \theta_1) , \sigma^2_1 \mathcal I) + \\ & \sum \limits_{t = 2}^T \frac{1}{2\sigma^2_t} \| \frac{1 - \tilde{\beta}_{t-1}}{1 - \tilde{\beta}_t} \sqrt{1 - \beta_t} \mathbf z_{it} + \frac{\sqrt{\tilde{\beta}_{t-1}} \beta_t}{1 - \tilde{\beta}_t} \mathbf x_i - g_t\left[\mathbf z_{it}, \boldsymbol \theta_t \right]\|^2 \end{align} \]
The first term is the reconstruction term
The second is the distance between the target mean of the conditional and the approximated mean!
This is trainable!
Practical notes:
Diffusion models are often reparameterized in terms of the noise. This makes the model a little easier to train, but loses the theoretical simplicity of the original construction.
\[ Q(\mathbf z_{t-1} | \mathbf z_t, \boldsymbol \theta_t) \sim \mathcal N \left(\mathbf z_{t-1} \mid g(\mathbf z_{t}, \boldsymbol \theta_t) , \sigma^2_t \mathcal I \right) \]
Each diffusion step is associated with its own set of parameters!
This means that we should train different neural networks for each diffusion step
In general, this is not plausible
For images, one approach is to create a deep nonlinear mapping between the diffused image at time \(t-1\) and time \(t\)
Learn a set of parameters that will take in an image and return a diffused image of the same size
Image to image prediction
Anyone got a proposal for a method that does that?
So, at each transition we need a full UNet.
Instead, a clever trick:
Model each step as a step in a UNet
But, encode time using positional embeddings
Where have we seen time and sequences arise previously?
A single UNet:
Each block represents a time step in the diffusion process
Treated like a sequential step in a transformer model with self-attention (the encoder side)
Self-attention steps that look across all time points in the diffusion process to figure out how different diffusion steps relate to one another
Time embedding helps to create locational dependencies among the time steps
A little more scalable!
But, will still require a lot of time steps to effectively generate realistic looking images.
In general, diffusion models work better when \(T\) is large and \(\beta\) is really close to zero
Think of it as creating a lot of very small noise steps between the original image and the stable point
More small steps mean more places for differences between images
Diffusion is a lot slower than other generative models
Any time you see a transformer, you should immediately think that it requires a lot of data and a lot of computational resources!
We normal humans aren’t going to be able to really train good diffusion models
Stable diffusion is a recent diffusion generative model
A collection of a number of different advances in generative and discriminative modeling!
Inputs:
A set of \(N\) images and \(N\) associated text strings/prompts
Multiple parts. All things we’ve seen before!
Step 1: VAE compression
Pass the original image through a VAE encoder to transform the high res image into a smaller, noisier space
The latent variable for the VAE is structured to be another image - instead of a vector, the latent space is a 32x32 version of the original image
A little noisier and smaller than the original input
We know that VAEs work, but create blurry images. Learn what we can from a fast VAE and then let the diffusion model learn the rest!
Step 2: Forward DIffusion
Create a sequence of diffusion steps
Add a little noise at each step
For training, hundreds of steps with very small noise variance!
This part is the least computationally intense.
Step 3: Compute prompt embeddings
Given different prompts, compute an embedding for each prompt.
Most frequently, use a pre-trained BERT model with 512 or 1024 dimensional embeddings
Step 4: Denoising UNet
UNet with positional embeddings to translate fully diffused image back to pixel space
Cross-attention with prompt embedding at each step
Self-attention with all diffusion time steps at each step
The money maker:
With enough GPUs, forego the time embedding shortcut and train a separate UNet for each time transition!
Better quality images since there are way more parameters.
We mere mortals can’t replicate this part…
Step 5: VAE decoder
Take the output image from the denoising UNet and decode it using the trained VAE.
This entire process is backpropable
Leads to an absolutely insane level of image quality that largely matches the prompt!
We probably can’t train our own stable diffusion models.
But, we can use pre-existing ones!
Stable diffusion is the new hotness
Like GPT, it is only usable by those with the most GPU resources
But, generated image quality is higher than that of GANs
Text to Image conditioning is easier in Stable Diffusion than with Conditional GANs, so more control over what gets sampled.
Transformer style diffusion models seem to work really well
More research is needed to understand why this seems to work so much better than self-attention GANs
GANs and VAEs are way more computationally efficient
THis can be your research area…